arXiv:1506.05967v3 [stat.ML] 8 Mar 2016 


Doubly Decomposing 
Nonparametric Tensor Regression 

Masaaki Imaizumi^ and Kohei Hayashi^ 

Tniversity of Tokyo 
^National Institute of Informatics 


Abstract 

Nonparametric extension of tensor regression is proposed. Nonlinearity in a 
high-dimensional tensor space is broken into simple local functions by incorporating 
low-rank tensor decomposition. Compared to naive nonparametric approaches, 
our formulation considerably improves the convergence rate of estimation while 
maintaining consistency with the same function class under specific conditions. To 
estimate local functions, we develop a Bayesian estimator with the Gaussian process 
prior. Experimental results show its theoretical properties and high performance 
in terms of predicting a summary statistic of a real complex network. 


1 Introduction 


Tensor regression deals with matrices or tensors fi.e ., multi-dime i isionp arrays') as covari¬ 
ates ( i nputs) to predict s calar respoiises (o u tputs! IWang et al\ (1201411: iHung and Wand _ 

( 2013 1: Zhao et ^ ( 2014 1: Zhou et al. ( 2013 1: Tomioka et al. ( 2007fl : Suzuki ( 20151) : Guhanivogi et al. 
(1201511 . Suppose we have a set of n observations Dn = U ^ T is a respon¬ 

dent variable in the space T C M and Xj G df is a covariate with iCth-order tensor form 
in the space X C Khx...x7K^ where Ik is the dimensionality of order k. 'With the above 
setting, we consider the regression problem of learning a function f : X ^ y as 


y = f{x,) + u,, 


( 1 ) 


where Ui is zero-mean Gaussian noise with variance a^. Such problems can be found 
in several applications. For example, studies on brain-computer interfaces attempt to 
predict human intentions (e.g., determining whether a subject imagines huger tapping) 
from brain activities. Electroencephalography (EEG) measures brain activities as electric 
signals at several points (channels) on the scalp, giving channel x time matrices as 
covariates. Functional magnetic resonance imaging captures blood how in the brain as 
three-dimensional voxels, giving X-axis x Y-axis x Z-axis x time tensors. 

There are primarily two approaches to the tensor regression problem. One is assuming 
linearity to / as 


f{x,) = {B,xy 


( 2 ) 
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where B G is a weight parameter with the same dimensionalities as X and 

denotes the inner prodnct. Since B is ver y high- 


,X 


31---3K 


(B,x) = 

dimen si onal in gene r al, sev e ral anthors have incorpo ra ted a low-rank stru ct ure to B iDvrholm et al 
( 2007 1: Zhou et al. ( 2013 1: Hung and Wane ( 2013 1: Wang et al. ( 2014 1: Suzuki ( 20151) : 
Guhanivogi et al\ (120151) . We collectively refer to the linear models ([2]) with low-rank B 
as tensor linear repre s sion ( T LR). As an altern ative, a nonparametric approach has been 


proposed Zhao et al. ( 2013ll : Hon et al. ( 2015ll . When f{X) belongs to a proper func¬ 


tional space, with an appropriately choosing kernel function, the nonparametric method 
can estimate / perfectly even if / is nonlinear. 

In terms of both theoretical and practical aspects, the hias-variance tradeoff is a cen¬ 
tral issue. In TLR, the function class that the model can represent is critically restricted 
due to its linearity and the low-rank constraint, implying that the variance error is low 
but the bias error is high if the true function is either nonlinear or full rank. In contrast, 
the nonparametric method can represent a wide range of functions and the bias error can 
be close to zero. However, at the expense of the flexibility, the variance error will be high 
due to the high dimensionality, the notorious nature of tensors. Generally, the optimal 
convergence rate of nonparametric models is given by 


Q(^-/3/(2/3+d)^^ 


( 3 ) 


wh ich is dom i nated by the input dimensionality d and the smoothness of the true function 
f (iTsvbakovl . 120081) . For tensor regression, d is the total number of X’s elements, i.e., 
rifc When each dimensionality is roughly the same as R ~ ~ Jx, d = 0(/f^), which 

signihcantly worsens the rate, and hinders application to even moderate-sized problems. 

In this paper, to overcome the curse of dimensionality, we propose additive-multiplicative 
nonparametric regression (AMNR), a new class of nonparametric tensor regression. In¬ 
tuitively, AMNR constructs / as the sum of local functions taking the component of a 
rank-one tensor as inputs. In this approach, functional space and the input space are 
concurrently decomposed. This “double decomposition” simultaneously reduces model 
complexity and the effect of noise. For estimation, we propose a Bayes estimator with 
the Gaussian Process (GP) prior. The following theoretical results highlight the desirable 
properties of AMNR. Under some conditions, 

• AMNR represents the same function class as the general nonparametric model, 
while 

• the convergence rate ([3]) is improved as d = Ik' {k' = argmax^ J^), which is 
times better. 

We verify the theoretical convergence rate by simulation and demonstrate the empirical 
performance for real application in network science. 


2 AMNR: Additive-Multiplicative Nonparametric Re¬ 
gression 

First, we introduce the basic notion of tensor decomposit ion. With a hnite positive integer 
R*, th e CANDECOMP/PARAFAC (CP) decomposition Harshman ( 1970 1: Garroll and Ghang 
( 1970 1 of X G Th is dehned as 


R* 


.y = 5^a 


.^(1) 


® ® . . .®X. 


(K) 


( 4 ) 


r=l 
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where 0 denotes the tensor product, G is a unit vector in a set := {n|n G 
||n|| = 1}, and is the scale of ..., xi^^} satisfying > A,.' for all r > r'. In 
this paper, R* is the rank of X. 

A similar relation holds for functions. Here, W^(A’) denotes a Sobolev space, which 
is P times differentiable functions with support X. Let g G W^(S') be such a function. If 
S is given by the direct product of multiple supports as S' = S'! x ••• x S'j, there exists a 
(possibly inhnite) set of local functions {gm G yV^{Sj)}m satisfying 




(5) 


for any g ( Hackbuschl . 2012 . Example 4.40). This relation can be seen as an extension of 
tensor decomposition with inhnite dimensionalities. 


2.1 The Model 

For brevity, we start with the case wherein X is rank one. Let X = ^f^Xk := Xi ^ 
• • • <8) Xk with vectors {xk G and / G W^((3)fc X^^'>) be a function on a rank one 

tensor. For any /, we can construct f{xi, ... ,xk) G x ... x such that 

f{xi,..., Xk) = f{X) using function composition as f = f o h with h : (xi,..., Xk) •—t 
(S). Xk- Then, using ([S]), / is decomposed into a set of local functions {/m^ G 
as: 

M* K 

f{X) = f{x„ ,xk) = J21[ (6) 

m=l k=l 

where M* represents the complexity of / (i.e., the “rank” of the model). 

With CP decomposition, (jH]) is amenable to extend for X ^ X having a higher rank. 
For i?* > 1, we dehne AMNR as follows: 

M* R* K 

/“'™(A') := E E If (7) 

m=l r=l k=l 

Aside from the summation with respect to m, AMNR ([7]) is very similar to CP decompo¬ 
sition dH) in terms of that it takes summation over ranks and multiplication over orders. 
In addition, as A^ indicates the importance of component r in CP decomposition, it con¬ 
trols how component r contributes to the hnal output in AMNR. Note that, for R* > 1, 
equality between jAmnr j ^ W^i^X) does not hold in general; see Section |U 


3 Truncated GP Estimator 

3.1 Truncation of M* and R* 

To construct AMNR ([7]), we must know M*. However, this is unrealistic because we do 
not know the true function. More crucially, M* can be inhnite, and in such a case the 
exact estimation is computationally infeasible. We avoid these problems using predehned 
M < oo rather than M* and ignore the contribution from {/m^ : rn > M}. This may 
increase the model bias; however, it decreases the variance of estimation. We discuss how 
to determine M in Section IT2l 
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For R*, we adopt the same strategy as M*, i.e., we prepare some R < R* and approx¬ 
imate X as a rank- R tensor. Because this approximation reduces some information in 
X, the prediction performance may degrade. However, if R is not too small, this prepro¬ 
cessing is justihable for the following reasons. First, this approximation possibly removes 
the noise in X. In real data such as EEG data, X often includes observation noise that 
hinders the prediction performance. However, if the power of the noise is sufficiently 
small, the low-rank approximation discards the noise as the residual and enhances the 
robustness of the model. In addition, even if the approximation discards some intrin¬ 
sic information of X, its negative effects could be limited because As of the discarded 
components are also small. 

3.2 Estimation method and algorithm 

For each local function fm\ consider the GP prior GP{fm^), which is represented as 
multivariate Gaussian distribution J\f{0Rn, Km'^) where is the zero element vector of 
size Rn and Km is a kernel Gram matrix of size Rn x Rn. The prior distribution of the 
local functions ^ := {fm^}m,k is then given by: 

M K 

Ad) = n n GPif!!:')- 

m=l k=l 

From the prior 7r(^) and the likelihood ]dj N(Yi\f{Xi), a^), Bayes’ rule yields the posterior 
distribution: 

T^mDn) 

where = Y,m=iJ2r=i ^r,iYlk=i ^ = {fm^}m,k are dummy variables 

for the integral. We use the posterior mean as the Bayesian estimator of AMNR: 

/ MR K 

^ n ft^dnmDn)d^. (9) 

m=l r=l k=l 

To obtain predictions with new inputs, we derive the mean of the predictive distribu¬ 
tion in a similar manner. 

Since the integrals in the above derivations have no analytical solution, we compute 
them numerically by sampling. The details of the entire procedure are summarized as 
follows. Note that Q denotes the number of random samples. 

• Step 1: CP decomposition of input tensors 

With the dataset apply rank-i? GP decomposition to Xj and obtain and 

{xfj} for f = 1,... ,77,. 

• Step 2: Construction of the GP prior distribution 

Construct a kernel Gram matrix Km'^ from for each m and k, and obtain 

random samples of the multivariate Gaussian distribution Km^). For each 

sampling q = 1,... ,Q, obtain a value for each r, m, k, and i = 1,. ■ ■ ,n. 
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• Step 3: Computation of likelihood 

To obtain the likelihood, calculate Ilfc sampling q and 

obtain the distribution by ([ 8 ]). Obtain the Bayesian estimator / and select the 
hyperparameters (optional). 

• Step 4: Prediction with the predictive distribution 

Given a new input X\ compute CP decomposition and obtain \'r and 
Then, sample from the prior for each q. By multiplying the likelihood 

calculated in Step 3, derive the predictive distribution of Hfc 

and obtain its expectation with respect to q. 

Remark Although CP decomposition is not unique up to sign permutation, our model 
estimation is not affected by this. For example, tensor X with R* = 1 and K = 3 has 
two equivalent decompositions: (A) xi ^X 2 ^X 3 and (B) xi ® {—X 2 ) ® {—X 3 ). If training 
data only contains pattern (A), prediction for pattern (B) does not make sense. However, 
such a case is pathological. Indeed, if necessary, we can completely avoid the problem 
by hipping the sign of xi,X 2 , and 0:3 at random while maintaining the original sign of X. 
Although the sign hipping decreases the effective sample size, it is absorbed as a constant 
term and the convergence rate is not affected. 

4 Theoretical Analysis 

Our main interest here is the asymptotic behavior of distance between the true function 
that generates data and an estimator ([2]). Preliminarily, let /° G W^(A’) be the true 
function and /„ be the estimator of /°. To analyze the distance in more depth, we 
introduce the notion of rank additivit'^ for functions, which is assumed implicitly when 
we extend ([ 6 ]) to o. 

Definition 1 (Rank Additivity). A function f \ X is rank additive if 



R* 




where Xr := XrX^r'^ ® ® xi^\ 

Letting f* be a projection of /° onto the Sobolev space / G satisfying rank 
additivity, the distance is bounded above as 

ii/"-/„ii<ii/“-/*ii + iir-/„ii. (10) 


Unfortunately, the first term ||/° —/*|| is difficult to evaluate, aside from a few exceptions; 
if R* > 0 or /° is rank additive, ||/° — /*|1 = 0 . 

Therefore, we focus on the rest term ||/* — fn\\- By definition, f* is rank additive and 
the functional tensor decomposition ([ 6 |) guarantees that f* is decomposed as the AMNR 
form d?]) with some M*. Here, the behavior of the distance strongly depends on M*. We 
consider the following two cases: (i) M* is finite and (ii) M* is infinite. In case (i), the 


^This type of additivi t y is often assumed in multivariate and additive model analysis 


Hastie and Tibshiranil (: iRavikumar et ol. ( 2009ll . 
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Mere. 




VAMNRwith small M*. 

^ AMNR with large M* _y 

Sobolev space (rank additive) 


Figure 1: Functional space and the effect of M*. 

consistency of /„ to f* is shown with an explicit convergence rate (Theorem [T]). More 
surprisingly, the consistency also holds in case (ii) with a mild assumption (Theorem [2]). 

Figure [U illustrates the relations of these functions and the functional space. The 
rectangular areas are the classes of functions represented by AMNR with small M*, 
AMNR with large M*, and Sobolev space with rank additivity. 

Note that the formal assumptions and proofs of this section are shown in supplemen¬ 
tary material. 


4.1 Estimation with Finite M* 


The consist ency of Bayes i an nonparametric estimators is evaluat e d in terms of posterior _ 

consis tency Ghosal et al. ( 2000 ): Ghosal and van der Vaar'O ( 20071) : van der Vaart and van Zanten 
(I 2 OO 8 I) . Here, we follow the same strategy. Let ||/|1^ := ^ X]r=iempirical 
norm. We define as the contraction rate of the estimator of local function which 
evaluates the probability mass of the GP around the true function. Note that the order 

ik') 

of Cn' depends on the covariance kernel function of the GP prior, in which the optimal 
rate of is given by ([3]) with d = Ik- For brevity, we suppose that the variance of the 
noise Ui is known and the kernel in the GP prior is selected to be optimalU Then, we 
obtain the following result. 

Theorem 1 (Gonvergence analysis). Let M = M* < 00 . Then, with AssumptionUl (ind 
some finite constant G > 0, 

^ 11 /^ _ f*\\l < (^^-2/3/(2/3+maxfe4)_ 

Theorem [T] claims the validity of the estimator ([9]). Its convergence rate corresponds 
to the minimax optimal rate of estimating a function in on compact support in M^'=, 
showing that the convergence rate of AMNR depends only on the largest dimensionality 
of X. 


4.2 Estimation with Infinite M* 


When M* is infinite, we cannot use the same strategy used in Section 14.11 Instead, 
we truncate M* by finite M and evaluate the bias error caused by the truncation. To 


^We assume that the Matern kernel is selected and the weight of the kernel is equal to /3- Under these 
conditions, the optimal rate is achieved (|Tsvbakov . 
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evaluate the bias, we assume that the local functions are in descending order of their 
volumes fm := {fm} are ordered _as satisfying ||/m'||2 > Wfmh for all 

m' > m. We then introduce the assumption that ||/m|| decays to zero polynomially with 
respect to m. 

Assumption 1. With some constant 7 > 0, 

Wfmh = O (m"^) , 


as m ^ 00 . 

This assumption is sufficiently limited. For example, if we pick {/m^} as =1 

are orthogonal to each then the Parseval-type inequality leads to = 

II Em/mlh = II/II2 < 00, implying ||/^||2 0 as m 00. 

Then we claim the main result in this section. 


Theorem 2. Suppose we construct the estimator Q with 

M X (^7i“W(2^+maxfc4)^7/(l+7) 


where x denotes equality up to a constant. Then, with some finite constant C > 0, 


E\\fn — /*|1^ < (7('^-2^/(2/3+maxfe4)^7/(l+7) 


The above theorem states that, even if we truncate M* by hnite M, the convergence 
rate is nearly the same as the case of hnite M* (Theorem [T]), which is slightly worsened 
by the factor 7/(1 + 7 ). 

Theorem [2] also suggests how to determine M. For example, if 7 = 2,/5 = 1, and 
maxfc/fc = 100, M x is recommended, which is much smaller than the sample size. 
Our experimental results (Section 16.31) also support this. Practically, very small M is 
sufficient, such as 1 or 2, even if n is greater than 300. 

Here, we show the conditional consistency of AMNR, which is directly derived from 
Theorem [2l 

Corollary 1. For all function f* G with finite M = M* or Assumption [T], the 
estimator ([9]) satisfies 

EWfn-fYn^O, 


as n —)> 00. 


5 Related Work 


5.1 Nonparametric Tensor Regression 

The tensor GP (TGP) f Zhao et a,l. . 20141: Hon et aZI. 2015 ) in a method that estimates 


the function in >V^(iS) directly. TGP is essentially a GP regression m odel that Fa t tens a 


tensor into a high-dimensional vector and takes the vector as an input. IZhao et a,l\ fj2014l) 


proposed i t s esti mator and applied it to image recognition from a monitoring camera. 
Hon et al\ fj2015l) applied the method to analyze brain signals. Although both studies 


^For example, we can obtain such {/m^} by the GramSchmidt process. 
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demonstrated the high performance of TGP, its theoretical aspects such as convergence 
ha ve not been discu s sed. 

Signoretto et al. f 2ni3h proposed a regression model with tensor product reproducing 
kernel Hilbert spaces (TP-RKHSs). Given a set of vectors {xk}k=ii their model is written 
as 


( 11 ) 

j k 

where aj is a weight. The key difference between TP-RKHSs flTT]) and AMNR is in 
the input. TP-RKHSs take only a single vector for each order, meaning that the input 
is implicitly assumed as rank one. On the other hand, AMNR takes rank-i? tensors 
where R can be greater than one. This difference allows AMNR to be used for more 
general purposes, because the tensor rank observed in the real world is mostly greater 
than one. Furthermore, the properties of the estimator, such as convergence, have not 
been investigated. 


5.2 TLR 


For the matrix case {K = 2), iDvrholm et al\ (1200711 proposed a cl assihcation model as 


o. where B is assumed to be low rank. 
regres sion where the expectation is given by 


Hung and Wangl (1201311 propose d a logistic 


and R is a rank-one matrix. Zhou et al. 


( 2013 1 extended these concepts for tensor inputs. ISuzukil (1201511 and iGuhanivogi et al 


(1201511 proposed a Bayes estimator of TLR and investigated its convergence rate. 

Interestingly, AMNR is interpretable as a piecewise nonparametrization of TLR. Sup¬ 
pose B and X have rank-M and rank-R GP decompositions, respectively. The inner 
product in the tensor space in ([2]) is then rewritten as the product of the inner product 
in the low-dimensional vector space, i.e.. 


M R 


K 


{B, n 




T 

m 5 '^r^i 


). 


( 12 ) 


m=l r=l 


k=l 


(k) 

where bm is the order-A" decomposed vector of B. The AMNR formation is obtained by 
replacing the inner product {bm\ with local function 

From this perspective, we see that AMNR incorporates the advantages of TLR and 
TGP. AMNR captures nonlinear relations between Y and X through fm\ which is im¬ 
possible for TLR due to its linearity. Nevertheless, in contrast to TGP, an input of the 
function constructed in a nonparametric way is given by an R-dimensional vector rather 
than an {R ,■■■, Ik) -dimensional tensor. This reduces the dimension of the function’s 
support and signihcantly improves the convergence rate (Section 0]). 


5.3 Other studies 


Koltchinskii et al\ (1201011 and lSuzuki et a,l\ (1201311 investigated Multiple Kernel Learning 


(MKL) considering a nonparametric p-variate regression model with an additive struc¬ 
ture: Yl^j=i fji^j)- handle high dimensional inputs, MKL reduces the input dimen¬ 
sionality by the additive structure for fj and Xj. Note that both studies deal with a 
vector input, and they do not ht to tensor regression analysis. 








































Table 1: Comparison of related methods. 


L Method 

Tensor 

Input 

Non- Convergence 

LINEARITY RATE WITH ([3]) 

TLR 

V 


d = 0 

TGP 

V 

V 

d = nkik 

TP-RKHSs 

RANK-1 

V 

N/A 

MKL 


V 

N/A 

AMNR 

V 

V 

d = max Ik 


TLR AMNR T f TGP 



(a) Full view (b) Enlarged view Full view (b) Enlarged view 

Figure 2: Synthetic data experiment: Low- Figure 3: Synthetic data experiment: Full- 
rank data. rank data. 


Table [T] summarizes the methods introduced in this section. As shown, MKL and 
TP-RKHSs are not applicable for general tensor input. In contrast, TLR, TGP, and 
AMNR can take multi-rank tensor data as inputs, and their applicability is much wider. 
Among the three methods, AMNR is only the one that manages nonlinearity and avoids 
the curse of dimensionality on tensors. 


6 Experiments 

6.1 Synthetic Data 

We compare the prediction performance of three models: TLR, TGP, and AMNR. In all 
experiments, we generate datasets by the data generating process (dgp) as R = f*{X)+u 
and hx the noise variance as = 1. We set the size of X G i.e., K = 2 and 

h = h = 20. By varying the sample size as n G {100,200,300,400,500}, we evaluate 
the empirical risks by the mean-squared-error (MSE) for the testing data, for which we 
use one-half of the samples. For each experiment, we derive the mean and variance of 
the MSEs in 100 trials. For TGP and AMNR, we optimize the bandwidth of the kernel 
function by grid search in the training phase. 
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♦ 4 TLR 


AMNR T- •» TGP 



R 


fa) Full view 



(b) Enlarged view 



Figure 4: Synthetic data experiment: Sensitivity of R. 


Figure 5: Synthetic data exper¬ 
iment: Sensitivity of M. 


6.1.1 Low-Rank Data 

First, we consider the case that X and the dgp are exactly low rank. We set R*, the true 
rank of X, a.s R* = 4 and 




K 


r(^) = + ^ 


r=l 


k=l 


where [ 7 ]^ = O.lj. The results (Figure [2]) show that AMNR and TGP clearly outperform 
TLR, implying that they successfully capture the nonlinearity of the true function. To 
closely examine the difference between AMNR and TGP, we enlarge the corresponding 
part (Figure [2](b)), which shows that AMNR consistently outperforms TGP. Note that 
the performance of TGP improves gradually as n increases, implying that the sample size 
is insufficient for TGP due to its slow convergence rate. 

6.1.2 Full-Rank Data 

Next, we consider the case that X is full rank and the dgp has no low-rank structure, i.e., 
model misspecihcation will occur in TLR and AMNR. We generate X as Xj^j^ AA(0,1) 
with 

K 

k=l k 

The results (Figure [3]) show that, as in the previous experiment, AMNR and TGP out¬ 
perform TLR. Although the difference between AMNR and TGP (Figure [Hlfb)) is much 
smaller, AMNR still outperforms TGP. This implies that, while the effect of AMNR’s 
model misspecihcation is not negligible, TGP’s slow convergence rate is more problematic. 


6.2 Sensitivity of Hyperparameters 

Here, we investigate how the truncation of R* and M* affect prediction performance. In 
the following experiments, we hx the sample size as n = 300. 

First, we investigate the sensitivity of R. We use the same low-rank dgp used in 
Section 15.1.11 (i.e., R* = 4.) The results (Figure 0]) show that AMNR and TGP clearly 
outperform TLR. Although their performance is close, AMNR beats TGP when R is not 
too large, implying that the negative effect of truncating R* is limited. 
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Next, we investigate the sensitivity of M. We use the same full-rank dgp used in 
Section [6.1.21 Figure [5] compares the training and testing MSEs of AMNR, showing that 
both errors increase as M increases. These results imply that the model bias decreases 
quickly and estimation error is more dominant. Indeed, the lowest testing MSE is achieved 
at M = 2. This agrees satisfactory with the analysis in Section 14.21 which recommends 
small M. 

6.3 Convergence Rate 

Here, we conhrm how the empirical convergence rates of AMNR and TGP meet the 
theoretical convergence rates. To clarify the relation, we generate synthetic data from 
dgp with /9 = 1 such that the difference between TGP and AMNR is maximized. To do 
so, we design the dgp function as f*{X) = Il^i and 

i 


where (j)i{z) = \/2cos((/ — \I‘I)'kz') is an orthonormal basis function of the functional 
space and sin(/) jf| 


Setting 

No. 

K 

R* 

h 

h 

h 

d IN dSj) 

TGP AMNR 

(I) 

3 

2 

10 

10 

10 

1000 

10 

(n) 

3 

2 

3 

3 

3 

27 

3 

(III) 

3 

2 

10 

3 

3 

90 

10 


Table 2: Synthetic data experiment: Settings for convergence rate. 

For X we consider three variations: 3x3x3, lOx, 3 x 3, and 10 x 10 x 10 (Tabled]). 
Figure |6] shows testing MSEs averaged over 100 trials. The theoretical convergence rates 
are also depicted by the dashed line (TGP) and the solid line (AMNR). To align the 
theoretical and empirical rates, we adjust them at n = 50. The result demonstrates the 
theoretical rates agree with the practical performance. 



LU 

1/1 


O 



(I) 10 X 10 X 10 
tensor. 


(II) 3x3x3 
tensor. 


(Ill) 10 X 3 X 3 
tensor. 


TLR ~1 


Figure 6: Gomparison of convergence rate. 


^This dgp is derived from a theory of Sobolev ellipsoid; see iTsvbakovI (|2008ll . 
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6.4 Prediction of Epidemic Spreading 


Here, w e deal with t h e epid emic spreading problem in complex networks I Anderson and Mav 
fll99ll) : IVespignanil (120121) as a matrix regression problem. More precisely, given an 
adjacency matrix network Xj, we simulate the spreading process of a disease by the 
susceptible-infected-recovered (SIR) model as follows. 


1. We select 10 nodes as the initially infected nodes. 

2. The nodes adjacent to the infected nodes become infected with probability 0.01. 

3. Repeat Step 2. After 10 epochs, the infected nodes recover and are no longer 
infected (one iteration = one epoch). 


After the convergence of the above process, we count the total number of infected nodes as 
Yj. Note that the number of infected nodes depends strongly on the network structure and 
its prediction is not trivial. Conducting the simulation is of course a reliable approach; 
however, it is time-consuming, especially for large-scale networks. In contrast, once we 
obtain a trained model, regression methods make pred iction very quick. 

As a real network, we use the Enron email dataset iKlimt and Yand (120041) . which is 
a collection of emails. We consider these emails as undirected links between senders and 
recipients (i.e., this is a problem of estimating the number of email addresses infected by 
an email virus). First, to reduce the network size, we select the top 1, 000 email addresses 
based on frequency and delete emails sent to and received from other addresses. After 
sorting the remaining emails by timestamp, we sequentially construct an adjacency matrix 
from every 2, 000 emails, and we hnally obtain 220 input matrices. 

For the analysis, we set i? = 2 for the AMNR and TLR methodsH Although R = 2 
seems small, we can still use the top-two eigenvalues and eigenvectors, which contain a 
large amount of information about the original tensor. In addition, the top eigenvector s 
are closely related to the threshold of outbreaks in infection networks IWang et al\ (120031) . 
From these perspectives, the good performance demonstrated by AMNR with R = 2 
is reasonable. The bandwidth of the kernel is optimized by grid search in the training 
phase. 

Figure [7] shows the training and testing MSEs. Firstly, there is a huge performance 
gap between TLR and the nonparametric models in the testing error. This indicates that 
the relation between epidemic spreading and a graph structure is nonlinear and the linear 
model is dehcient for this problem. Secondly, AMNR outperforms TGP for every n in 
both training and testing errors. In addition, the performance of AMNR is constantly 
good and almost unaffected by n. This suggests that the problem has some extrinsic 
information that inflates the dimensionality, and the efficiency of TGP is diminished. On 
the other hand, it seems AMNR successfully captures the intrinsic information in a low¬ 
dimensional space by its “double decomposition” so that AMNR achieves the low-bias 
and low-variance estimation. 


7 Conclusion and Discussion 

We have proposed AMNR, a new nonparametric model for the tensor regression problem. 
We constructed the regression function as the sum-product of the local functions, and 

®We also tested i? = 1, 2,4, and 8; however, the results were nearly the same. 
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Figure 7: Epidemic spreading experiment: Prediction performance. 


developed an estimation method of AMNR. We have verihed our theoretical analysis and 
demonstrated that AMNR was the most accurate method for predicting the spread of an 
epidemic in a real network. 

The most important limitation of AMNR is the computational complexity, which is 
better than TGP but worse than TLR. The time complexity of AMNR with the GP esti¬ 


mator is 0{nRY\^ Ik+M{nR)^ where GP decomposition requires 0{RY\^ R nPhan et al 

(120131) for n inputs and the GP prior requires 0{Ik{nR)^) for the MK local func¬ 
tions. On the contrary, TGP requires 0{n^Y\k^k) computation because it must eval¬ 
uate all the elements of X to construct the kernel Gram matrix. When R and 

MR^J2k^k -C Ylk^k, which are satished in many practical situations, the proposed 
method is more efficient than TGP. 

Approximation methods fo r GP regression can be used to reduce the computational 
burden of AMNR. For example, Williams and Seeger f 200ll) proposed the Nystrom method, 
which approximates the kernel Gram matrix by a low-rank matrix. If we apply rank-L 
approximation, the computational cost of AMNR can be reduced to 0{{L^ + nL‘^) R). 
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A Proof of Theorem [T] 


Here, we describe the detail and proof of Theorem 1. At the beginning, we introdnce a 
general theory for evalnating the convergence of a Bayesian estimator. 

Preliminary, we introdnce some theorems from previous studies. Let Pq be a true 
distribution of X, K{f,g) be the Kullback-Leibler divergence, and dehne V{f,g) = 
J{\og{f/g))‘^fdx. Let d be the Hellinger distance, N{e,V,d) be the bracketing number, 
and D{e,V,d) be the packing number. Also we consider a reproducing kernel Hilbert 
space (RKHS), which is a closure of linear space spanned by a kernel function. Denote 
by the RKHS on 

The following theorem provides a novel tool to evaluate the Bayesian estimator by 
posterior contraction. 


Theorem 3 (Theorem 2.1 in Ghosal et al. ( 20001) 1. Consider a posterior distribution 
Iln{-\Dn) on a set V. Let e„ he a sequence such that e„ —)■ 0 and ne\ —)■ oo. Suppose that, 
for a constant G > 0 and sets Vn C V, we have 


1. \ogD{en,Vn,d) < nel, 

2. Ur,{Vn\V)<exp{-nel{C + 4)), 

3. Un{p : -K{p,po) < el,V(p,po) < el) > exp{-Cnel). 


Then, for sufficiently large C, EYin{P '■ d{P, Pq) > G'e„)|D„) —)■ 0. 

Based on the theorem, van der Vaart and van Zanten f 2008h provide a more useful 
result for the Bayesian estimator with the GP prior. They consider the estimator for an 
inhnite dimensional parameter with the GP prior and investigated the posterior contrac¬ 
tion of the estimator with GP. They provide the following conditions. 


Condition (A) With some Banach space {B, || • ||) and RKHS (fH, || ■ ||), 

1 . log N{en, Bn, II • II) < Cnel, 

2 . Ft{W i Bn) < exp{-Cnel), 

3. Pr (||1K — tcoll < 2en) > exp(—Gne^), 


where hP is a random element in B and wq is a true function in support of W. 

When the estimator with the GP prior satished the a bove conditions, the posterior 
contraction of the estimator is obtained as Theorem 2.1 in Ghosal et al. iii^. 


Based on this, we obtain the following result. Consider a set of GP {{Wm}x '■ x G 
and let {F^ : x E X} he a stochastic process that satishes = 

f f L ^ JLi y' 

Also we assume that there exists a true function /q which is constituted by a unique 
set of local functions {tCm o}m=l,.■■,M,A;=l,...,ic• 

To describe posterior contraction, we dehne a contraction rate It converges to 
zero as n —?■ oo. Let (f)^^\e) be a concentration function such that 


0 ^^^(e) := inf 

-.Wh—WQWKt 


2 


logPr(||W(^)|| < 
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where || • ||'^(fc) is the norm induced by the inner product of RKHS. We dehne the con¬ 
traction rate with We denote a sequence {en'’}n,k satisfying 


The order of en^ depends on a choice of ke rnel function, where the optimal minimax rate 


is en'^ = 0{n l^k/Wk+hJ'j Tsvbakov (2008). In the following part, we set = e„ for 


every k. As k' = argmax^A, the satishes the condition about the concentration 
function. 

We also note the relation between posterior contraction and the well-known risk 
bound. Suppose the posterior contraction such that EYlniO : d^(0,6*o) > Ce^\Dn) —)• 0 
holds, where 6^ is a parameter, 6q is a true value, and dn is a bounded metric. Then we 
obtain the following inequality: 


EIlJdl{e.ea)\D„) < Cel 


I n ZT-TT (a 


79 ^ 


where D is a bound of dn- This leads EIln{dl{6,6o)\Dn) = O(C'e^). In addition, if 
9 1-^ dl{9, 9q) is convex, the Jensen’s inequality provides dl{9, 9q) < Iln{dl{9, 9o)\Dn). By 
taking the expectation, we obtain 

Edl{e,eo) < EUn{dl{e,eo)\Dn) = o{Cel). 


We start to prove Theorem [T] First, we provide a lemma for functional decomposition. 
When the function has a form of a fc-product of local functions, we bound a distance 
between two functions with a /c-sum of distance by the local functions. 

Lemma 1. Suppose that two functions f,g : —)■ M have a form f = n. fk and 

g = Yikdk with local functions fk,gk : Afc —)■ Then we have a bound such that 

11 / - ^11 < 5 ]] Wfk - 9k\\ max{||/fc|| , ll^fcll} . 

k 

Proof. We show the result based on induction. When k = 2, we have 


f ~ 9 — / 1/2 — 9i92 — /i (/2 — 92 ) + (/i — 9i)92) 


and 


11/ - ^11 < ll/ill 11/2 - 92\\ + ll/i - ^ill 11^2 


Thus the result holds when k = 2. 

Assume the result holds when k = k'. Let k = k' + 1. The difference between 
k' -\- 1-product functions is written as 

f - 9 = fk'+l n /fc ~ 9k’+l n 5'A: = fkifk'+l - 9k'+l) + 9k)9k'+l- 

h' h' h' h' h' 

Ay rb rb Ay Ay 

From this, we obtain the bound 


ll/-9ll< 


n* 


ll/fc'+i 


9k'+l 


Ylfk-Yl9k 


\\9k'+i\\ ■ 


The distance II fk — rifc'S'fcll is decomposed recursively by the case of k = k'. Then 
we obtain the result. □ 
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Now we provide the proof of Theorem [T] Note that M = M* < cxo in the statement 
in Theorem [TJ Asume that C, C, C",... are some positive hnite constants and they are 
not affected by other values. 

Proof. We will show that satishes the condition (A). Firstly, we check the third con¬ 
dition in the theorem. 

According to Lemma [H the value WF^ — /o|| is bounded as 


in-/oli<EE 


JJwf - JJw 


(fc) 

m,0 


- “ '^Soii n 


m r k 


k'^k 


W. 


(k') 

m,0 


pp(fc') 

' ' 'm 


By denoting 
/oil < Cn) as 




:= max 


w. 


(k') 

m,0 




, we evaluate the probability Pr(||F — 


Pr(||P-/o|| <e0 >Pr feEEll '^m,0 11 11 P/n 

\ m r k k'^k 


\ r km 


k'^k 

1 


< e. 


C. 


(13) 


where Ck is a positive hnite constant satisfying Ck = max^ Ha 




From Ivan der Vaart and van ZantenI (120081) , we use the following inequality for every 
Gaussian random element W: 

Pr (||fF - woll < €n) > exp(-ne^), 

Then, by seting e„ = ^ ^ and with some constant C, we bound ([T5]) below as 


p-- (eeeIpp®- 

\ r km 

> n pr ( vs' - <! 


,(fc) 

m,0 


m,r^k 


A) 


m^r,k 


- Ck "' 


(k) 


> 


nnn“p\ « 


m r k 


n 


(k),2 


> exp f-n ^ . 

\ m,r,k / 


For the second condition, we dehne a subspace of the Banach space as 

for all fc = 1,..., iP. Note and are unit balls in B and PL. Also, we dehne Bn as 


Bn := Iw : w = Wfc G Bll'\\/k 
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As shown in Ivan der Vaart and van ZantenI (120081) . for every r and k, 

Pr i -Sf) < 1 - 4(0?’ + Mf’). 


where $ is the cnmnlative distribntion fnnction of the standard Gaussian distribution; 

(k) ik') 

an and MA satisfy the following equation with a constant C > 0 as 

Mf> = - 2 #-'(exp(-C'n(£i'=y)). 


By setting ait'’ + and nsing the relation d'o(t) < ne^, we have 

Pr (Wifi ^ B<*)) <1-4 Qmw) = exp(-C'n(€!f>)^). 

This leads 

P'liF, i B„) < nnifpi'tw"™' ^ -Bi"’) 

m k r 

snnn<=i‘p(-c'"('«’)t 

m k r 

= exp(-G' 

m,r,k 


Finally, we show the hrst condition. Let be a set of elements of 

for all k. Also, we set that each are separ ated, thus en'^ balls with center d o 
not have intersections. According to Section 5 in Ivan der Vaart and van ZantenI (120081) . 
we have 


i>5^Pr(iyif)ehf+ 6^5^) 

i=i 

j\fW 

i=i 

> AfWexp exp(-4(4‘’’))- 

Consider 2e^f^-nets with center The nets cover we obtain 

Af(24‘),Afi‘)W<‘>, II . II) < IV^I < exp ClAfi*’’)") exp(#>(€lf))). 

Because every point in is within en^ from some point of we have 

iV(34‘>, Bf, II ■ II) < 1V(24‘>, AfWwfl, II . II). 
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By Lemma m for every elements w^w' G constructed as w = G Bn\ 

its distance is evaluated as 


\w — w \\ = 




< 


E 


\w 


W _ yj'W 


n 

k'^k 


\W 


(k') 


k'^k k 


\yjik) _ yj>(k)\ 


(14) 


We consider a set {h* : h* = Y\.kj 'which are the element of Bn- According to dH]), 
the Cen-aet with center {h*} will cover Bn, and its number is equal to Hfc A^(^n, Bn \ || • ||). 
Let =• ^nd we have 


logiV(36;,5., II ■ II) < logiV(3e(f),5f, || • ||) 

m,r,k 

< J]log]V(26!f),MWw<‘’>,|l.||) 

m,r,k 

< E (^(vW)" + #>(4‘>)) 

m,r,k ^ ^ 

< 5 ; (C"n(4‘>)= + C"'n(4*))=) 

m,r,k 


The last inequality is from the dehnition of and (/)^^^(e„). 

We check that the conditions (A) are all satished, thus we obtain posterior contraction 

(k) 

of the GP estimator with rate eh . Also, according to the connection between posterior 
contraction and the risk bound, we achieve the result of Theorem [T] 

□ 


B Proof of Theorem [2] 

We dehne the representation for the true function. Recall the notation / = J2m fm- We 
introduce the notation for the true function f* and the GP estimator / as follow: 

r 

fn 


m=l 

M 

= E 

m=l 


18 








We decompose the above two functions as follows: 


iir 


fWn = 


M 


m=l m=l 

oo M 


< 


oo 

E R 


m=l 


+ 


M 




m=l 


( 15 ) 


Consider the hrst term with Assumption 1. The expectation of the hrst term of ffT5|) 
is bounded by 


< 


E 

m=M-\-l 

oo 

c E ^ 

m=M+l 


< E II/. 

m=M+l 


* 

m \\2 


-7 


with hnite constant C. The exchangeability of the hrst inequality is guaranteed by the 
setting of /*. Also, the second inequality comes from Assumption 1. Then, the inhnite 
summation of m~'^ enables us to obtain that the expectation of the hrst term is 0{M~'^), 
by using the relation ^ 

About the second term of flThl) . we consider the estimation for f* with hnite M. As 
shown in the proof of Theorem [T], the estimation of fm is evaluated as 

( M 

E” 

m 

Finally, we obtain the relation 


2/3 + maxj, Ij, j _ Q I ][{fl 2/3 + maxj, 


E\\f* - f\\n = 0{M-^) + O (^Mn . 

Then, we allow M to increase as n increases. Let M n‘^ with positive constant (, and 
simple calculation concludes that ( = + 7) is optimal. By substituting 

we obtain the result. 
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